Expanding Arabic Treebank to Speech: Results from Broadcast News
نویسندگان
چکیده
• Initial hamza: • BN: All initial hamzas (glottal stops) are heard and transcribed with either a-, such as نإِ an~a (that) • Newswire (NW): Neutralized نا An form very common (1.5% of tokens in ATB3) • Annotators forced to distinguish between aforms based on context • The two forms require different POS and tree annotations, different guidelines For initial hamza, transcribed speech data actually presents fewer issues for downstream annotation than written NW data Status of BN Corpus Integration with SAMA • Status flag for each source token to make explicit the connection between morphological analysis from Standard Arabic Morphological Analyzer (SAMA) and ATB POS annotation
منابع مشابه
From Speech to Trees: Applying Treebank Annotation to Arabic Broadcast News
The Arabic Treebank (ATB) Project at the Linguistic Data Consortium (LDC) has embarked on a large corpus of Broadcast News (BN) transcriptions, and this has led to a number of new challenges for the data processing and annotation procedures that were originally developed for Arabic newswire text (ATB1, ATB2 and ATB3). The corpus requirements currently posed by the DARPA GALE Program, including ...
متن کاملThe need to create a media block for the convergence of overseas news networks
As a general diplomacy arm of the Islamic Republic of Iran, VoSiMa has extensive activities in international broadcasting of its radio and television programs. These programs are broadcast in different languages, such as English, French, Azeri, Arabic, and ... for regional and transnational audiences. The large volume of the organization's international activities is in the form of news and new...
متن کاملNetwork of Data Centres (NetDC): BNSC - An Arabic Broadcast News Speech Corpus
Broadcast news is a very rich source of Language Resources that has been exploited to develop and assess a large set of Human Language Technologies. Some examples include systems to: automatically produce text transcriptions of spoken data; identify the language of a text; translate a text from one language to another; identify topics in the news and retrieve all stories discussing a target top...
متن کاملQuick Rich Transcriptions of Arabic Broadcast News Speech Data
This paper describes the collect and transcription of a large set of Arabic broadcast news speech data. A total of more than 2000 hours of data was transcribed. The transcription factor for transcribing the broadcast news data has been reduced using a method such as Quick Rich Transcription (QRTR) as well as reducing the number of quality controls performed on the data. The data was collected f...
متن کاملVOXALEAD: A Scalable Video Search Engine Based On Content
Most news organizations provide immediate access to topical news broadcasts through RSS streams or podcasts. Until recently, applications have not permitted a user to perform content based search within a longer spoken broadcast to find the segment that might interest them. Recent progress in both automatic speech recognition (ASR) and natural language processing (NLP) has produced robust tools...
متن کامل